2019-03-18 | Reinforcement Learning | UNLOCK

David Silver强化学习课程-介绍

强化学习的不同

  • 不需要监督者,紧紧一个奖励信号
  • 反馈不是及时的
  • time really matters

problem in RL

Rewards

a scalar feedback signal

最小时间内最大化奖励

state

  • agent state
  • environment state
  • Fully observability
  • partial observability

智能体的三要素

  • Policy: agent’s behaviour function

    a map from state to action, $a=\pi(s)$

    stochastic policy: $\pi(a|s)=P[A_t=a|S_t=s]$

  • value function: how good is each state and/or action

  • model: agent’s representation of the environment

Exploration(探索)和Exploitation(开发)

Prediction和control

  • prediction,给定策略预测未来,计算值函数
  • control,找到最佳策略,最大化未来收益,计算值函数的同时更新策略,使得策略最优

评论加载中